The three of us are huge cricket fans. We come from India and cricket is not just a game there, it's a religion. When we got to know that we need to do a project using Python, all of us instantly connected with the topic and were amazed by the idea of making this team project a fun project. There are a couple of reasons that come to our minds when we think about the project’s importance. First, how cool would it be if we could actually predict the outcomes of cricket matches and see if a team is able to win against all odds. Second, this will be helpful for the advertisement industry who rely heavily on the game results to achieve the desired results from their marketing campaigns.
Four years is a long time for a team to go from zero to hero. This is exactly what happened with current World Cup champions England. England, in the 2015 World Cup, got eliminated in the group stages, however come 2019, they were crowned champions of the world in their own backyard.
We wish to analyse the data and draw various conclusions based on different types of data visualizations to help us understand how a particular team/player performs under certain conditions. The conditions include; the venue of the tournament, that is, whether they are playing at home or away from home, what kind of pitches they are playing on and how many tosses the team wins or loses. We would also take into consideration the number of matches a team has won at chasing or defending a total.
Questions of Interest
Wickets: The batting team has 10 wickets in hand. That is 10 players get to bat from each team.
Also refer to 'https://www.youtube.com/watch?v=g-beFHld19c'. This is a link to a short YouTube video that explains the game of Cricket in just 3.5 Minutes.
We have used web scraping to collect data from stats.espncricinfo.com, which has all the data of the past cricket matches and players. We scraped data for the past six years and saved it into multiple csv files. We required data of only the top ten teams that played the World Cup, but the website had the data for all the teams that play cricket. To clean the data and get the data for only the desired ten teams, we first removed the data of other unwanted teams. Second, the World Cup was played in England so we filtered the data for all the matches played only in England. Also, we merged the data from six different years (2014 to 2019), six different files into one CSV file becuse the file structure was same and we could now work on one file and get the stats for all the different years by indexing just one file. We also gathered information about the grounds in England where the 2019 World Cup was played and also the batsmen and the bowlers' data who played the world cup.
The code below shows the process of scraping. The files were scraped from the website and directly stored as CSV files on the local system
from bs4 import BeautifulSoup
import requests
from numpy import nan as NA
import numpy as np
import pandas as pd
url = 'http://stats.espncricinfo.com/ci/engine/records/team/series_results.html?class=2;id=201;type=decade'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Winner' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ SeriesResults.csv')
url = 'http://stats.espncricinfo.com/ci/engine/records/team/results_summary.html?class=2;id=201;type=decade'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Team' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ MatchResults.csv')
url = 'http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2015;type=year'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Team 1' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2015.csv')
url = 'http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2016;type=year'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Team 1' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2016.csv')
url = 'http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2017;type=year'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Team 1' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2017.csv')
url = 'http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2018;type=year'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Team 1' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2018.csv')
url = 'http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2019;type=year'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Team 1' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2019.csv')
url = 'http://stats.espncricinfo.com/ci/engine/records/team/highest_innings_totals.html?class=2;id=201;type=decade'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Team' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ HighestTotals.csv')
url = 'http://stats.espncricinfo.com/ci/engine/records/team/lowest_innings_totals.html?class=2;id=201;type=decade'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Team' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ LowestTotals.csv')
url = 'http://stats.espncricinfo.com/ci/engine/records/fielding/most_catches_career.html?class=2;id=201;type=decade'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Player' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ MostCatches.csv')
url = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=2;spanmax2=12+Dec+2019;spanmin2=12+Dec+2012;spanval2=span;template=results;type=aggregate;view=ground'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Mat' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ GroundNumbers.csv')
url = 'http://stats.espncricinfo.com/ci/content/records/283878.html'
page = requests.get(url)
readPandas = pd.read_html(url, match ='Mat' )[0]
readPandas.to_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Match_Results.csv')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
highest_totals = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ HighestTotals.csv')
# Highest 50 totals made by any team in ODI
highest_totals
# displaying teams that have scored these highest totals
highest_total_country = highest_totals.groupby('Team').count()['Score']
highest_total_country
# importing the Results dataset of different years
Results_2015 = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2015.csv')
Results_2016 = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2016.csv')
Results_2017 = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2017.csv')
Results_2018 = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2018.csv')
Results_2019 = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ Results2019.csv')
# concating these dataframes
Results = pd.concat([Results_2015, Results_2016, Results_2017, Results_2018, Results_2019])
Results.drop(["Unnamed: 0"], axis = 1, inplace = True)
Results.reset_index()
--> We merged five different results file from five different years into one to make one consoliated dataframe. This was done so that it would become easier to access all the results at once and we would not have to refer back and forth to access the results from different years.
# Get names of indexes for which column is not in given 10 teams
indexNames = Results[ (Results['Team 1'] != 'India') & (Results['Team 1'] != 'England') & (Results['Team 1'] != 'Pakistan') & (Results['Team 1'] != 'Sri Lanka') & (Results['Team 1'] != 'Australia') & (Results['Team 1'] != 'South Africa') & (Results['Team 1'] != 'New Zealand') & (Results['Team 1'] != 'Bangladesh') & (Results['Team 1'] != 'Afghanistan') & (Results['Team 1'] != 'West Indies') ].index
# Delete these row indexes from dataFrame
Results.drop(indexNames , inplace=True)
# Get names of indexes for which column is not in given 10 teams
indexNames = Results[ (Results['Team 2'] != 'India') & (Results['Team 2'] != 'England') & (Results['Team 2'] != 'Pakistan') & (Results['Team 2'] != 'Sri Lanka') & (Results['Team 2'] != 'Australia') & (Results['Team 2'] != 'South Africa') & (Results['Team 2'] != 'New Zealand') & (Results['Team 2'] != 'Bangladesh') & (Results['Team 2'] != 'Afghanistan') & (Results['Team 2'] != 'West Indies') ].index
# Delete these row indexes from dataFrame
Results.drop(indexNames , inplace=True)
Results
Above, we have the final dataframe with the results of all the teams that played in the world cup. These are essentially the top 10 teams that qualified fo the World Cup based on ICC (International Cricket Council)data. The International Cricket Council is the global governing body of cricket and conducts international matches
# Getting the teams with each of their number of wins
Results_winner = Results.groupby('Winner').count()['Ground']
Results_winner
ground_averages = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Ground_Averages.csv')
ground_averages.head()
Above dataframe contains the basic information about the grounds like:
# Defining the grounds of England
World_Cup_Grounds=["Lord's, London - England", "The Rose Bowl, Southampton - England", "Trent Bridge, Nottingham - England", "Sophia Gardens, Cardiff - England", "Kennington Oval, London - England", "Edgbaston, Birmingham - England", "Old Trafford, Manchester - England", "Riverside Ground, Chester-le-Street - England", "Headingley, Leeds - England", "County Ground, Bristol - England", "County Ground, Taunton - England"]
World_Cup_Grounds
Worldcup_Ground_Stats = []
England_Grounds = ground_averages.Ground
for grounds in England_Grounds:
for venues in World_Cup_Grounds :
if grounds in venues:
Worldcup_Ground_Stats.append((venues))
Worldcup_Ground_Stats
WorldCup_Grounds_Stats = ground_averages[ground_averages.Ground.isin([Ground for Ground in Worldcup_Ground_Stats])]
WorldCup_Grounds_Stats
match_results = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Match_Results.csv')
match_results.head()
# Get names of indexes for which column is not in given 10 teams
match_results.rename(columns = {'Country':'Team'}, inplace = True)
indexNames = match_results[ (match_results['Team'] != 'India') & (match_results['Team'] != 'England') & (match_results['Team'] != 'Pakistan') & (match_results['Team'] != 'Sri Lanka') & (match_results['Team'] != 'Australia') & (match_results['Team'] != 'South Africa') & (match_results['Team'] != 'New Zealand') & (match_results['Team'] != 'Bangladesh') & (match_results['Team'] != 'Afghanistan') & (match_results['Team'] != 'West Indies') ].index
# Delete these row indexes from dataFrame
match_results.drop(indexNames , inplace=True)
match_results
# creating a win perentage column as number of matches played by different teams is different
match_results['Win Percent'] = (match_results['Won']/match_results['Mat'])*100
match_results
import matplotlib
# ax = match_results.plot.bar(x='Team', y='Win Percent')
sns.barplot(x = "Team", y = "Win Percent", data = match_results).set_title("Win Percent of each Country")
plt.xlabel("Team")
plt.ylabel("Win Percent")
plt.xticks(rotation = 90)
ODI_Scores_Data = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ODI_Match_Results.csv')
WC_venue_pitches = ["The Oval, London","Trent Bridge, Nottingham","Sophia Gardens, Cardiff","County Ground, Bristol","Rose Bowl, Southampton","County Ground, Taunton","Old Trafford, Manchester","Edgbaston, Birmingham","Headingley, Leeds","Lord's, London","Riverside Ground, Chester-le-Street"]
#Total Grounds
WC_Ground_Stats = []
ODI_Grounds = ODI_Scores_Data.Ground
for i in ODI_Grounds:
for j in WC_venue_pitches:
if i in j:
WC_Ground_Stats.append((i,j))
Ground_names = dict(set(WC_Ground_Stats))
def Full_Ground_names(value):
return Ground_names[value]
Ground_names
#Let's gather the data of all ODI's in these WC Venues
WC_England_History = ODI_Scores_Data[ODI_Scores_Data.Ground.isin([Ground[0] for Ground in WC_Ground_Stats])]
WC_England_History["Ground"] = WC_England_History.Ground.apply(Full_Ground_names)
WC_England_History.head()
winnings = WC_England_History[["Country","Result"]]
winnings["count"] = 1
Ground_Results_Per_Team = winnings.groupby(["Country","Result"]).aggregate(["sum"])
Ground_Results_Per_Team = Ground_Results_Per_Team.groupby(level=0).apply(lambda x:100 * x / float(x.sum())).reset_index()
Ground_Results_Per_Team.columns = ["Country","Result","Count"]
Ground_Results_Per_Team.head()
import plotly.graph_objects as go
fig = go.Figure(data=[
go.Bar(name='aban', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'abandoned'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'aban'].Count),
go.Bar(name='lost', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'lost'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'lost'].Count),
go.Bar(name='n/r', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'n/r'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'n/r'].Count),
go.Bar(name='won', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'won'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'won'].Count),
go.Bar(name='-', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == '-'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == '-'].Count),
go.Bar(name='tied', x=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'tied'].Country, y=Ground_Results_Per_Team[Ground_Results_Per_Team.Result == 'tied'].Count),
])
# Change the bar mode
fig.update_layout(barmode='group', title = 'Details of Matches - Country Wise',xaxis_title="Country",yaxis_title="Count Of Matches",)
fig.show()
Inning_Wins = WC_England_History[WC_England_History.Result == "won"].Bat.value_counts(normalize = True).reset_index()
sns.barplot(x = "index", y = "Bat", data = Inning_Wins).set_title("Wins by Innings")
plt.xlabel("Innings")
plt.ylabel("Win Percentage")
Note: When one team bats and other bowls, it is termed as an innings. after 50 overs, the teams exchange roles and the second team now bats and first one bowls. This is now called second innings.
Pitch_Innings = WC_England_History[WC_England_History.Result == "won"][["Bat","Ground"]]
Pitch_Innings["Count"] = 1
Pitch_Innings = Pitch_Innings.groupby(["Ground","Bat"]).sum()
Pitch_Innings = Pitch_Innings.groupby(level=0).apply(lambda x:100 * x / float(x.sum())).reset_index()
Pitch_Innings.columns = ["Ground", "Bat","Wins"]
Pitch_Innings.head( 5 )
plt.figure(figsize=(15,8))
g = sns.lineplot( x = "Ground", y = "Wins", hue = "Bat", style="Bat", markers=True, data = Pitch_Innings)
plt.xticks(rotation = 60)
plt.title('Win - Batting 1st or 2nd')
Inning_Wins = WC_England_History[WC_England_History.Result == "won"].Toss.value_counts(normalize = True).reset_index()
sns.barplot(x = "index", y = "Toss", data = Inning_Wins).set_title("Wins by Toss")
plt.xlabel("Toss")
plt.ylabel("Win Percentage")
Pitch_Innings = WC_England_History[WC_England_History.Result == "won"][["Toss","Ground"]]
Pitch_Innings["Count"] = 1
Pitch_Innings = Pitch_Innings.groupby(["Ground","Toss"]).sum()
Pitch_Innings = Pitch_Innings.groupby(level=0).apply(lambda x:100 * x / float(x.sum())).reset_index()
Pitch_Innings.columns = ["Ground", "Toss","Wins"]
Pitch_Innings.head( 5 )
plt.figure(figsize=(15,8))
sns.barplot(x = "Ground", y = "Wins", hue = "Toss", data = Pitch_Innings).set_title("Results - Based on Toss")
plt.xticks(rotation = 60)
ODI_Scores = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ODI_Match_Totals.csv')
#Let's gather the data of all ODI's in these WC Venues
WC_England_Scores = ODI_Scores[ODI_Scores.Ground.isin([Ground[0] for Ground in WC_Ground_Stats])]
WC_England_Scores["Ground"] = WC_England_Scores.Ground.apply(Full_Ground_names)
WC_England_Scores.head()
WC_England_Scores = WC_England_Scores[~WC_England_Scores.Score.str.contains("D")]
Scores = [int(item[0]) for item in WC_England_Scores.Score.str.split("/")]
WC_England_Scores["Score_without_wickets"] = Scores
Stadium_Scores = WC_England_Scores[["Score_without_wickets","Ground"]]
Stadium_Scores = Stadium_Scores[Stadium_Scores.Score_without_wickets > 50]
plt.figure(figsize=(12,6))
plt.xticks(rotation = 60)
sns.violinplot(x = "Ground", y = "Score_without_wickets",data = Stadium_Scores).set_title("Scores vs Pitches")
plt.ylabel("Scores")
Grounds = WC_England_Scores.Ground.unique()
WC_Teams = WC_England_Scores.Country.unique()
Ground_Winnings = {}
for Ground in Grounds:
Ground_Winnings.update({Ground : {}})
for Team in WC_Teams:
Country_Ground_Record = WC_England_Scores[(WC_England_Scores.Country == Team) & \
(WC_England_Scores.Ground == Ground)]
matches_played = len(Country_Ground_Record)
if matches_played == 0:
continue
matches_won = len(Country_Ground_Record[Country_Ground_Record.Result == "won"])
winning_percentage = matches_won / matches_played * 100
Ground_Winnings[Ground].update({Team : {"matches_played" : matches_played,\
"matches_won": matches_won,\
"winning_percentage" : winning_percentage}})
Data_Frame_Data = []
for Pitch, P_Data in Ground_Winnings.items():
for Team, Team_Data in P_Data.items():
inside = []
inside.extend([Pitch,Team,Team_Data["matches_played"],\
Team_Data["matches_won"],Team_Data["winning_percentage"]])
Data_Frame_Data.append(inside)
Columns = ["Ground", "Country","Played","Won","Win_Percentage"]
Data_Frame_Data
Pitch_Team_Winnings = pd.DataFrame(Data_Frame_Data, columns=Columns)
Pitch_Team_Winnings.groupby(['Ground','Country']).mean()
import numpy as np
import matplotlib.pyplot as plt
category_names = [ 'lost', 'aban', 'tied', 'won']
results = {
'Australia': [52.380952, 4.761905, 0, 23.809524],
'Bangladesh': [50.000000, 0, 0, 25.000000],
'England': [30.000000, 2.857143, 1.428571, 58.571429],
'India': [27.777778, 5.555556, 0, 66.666667],
'Newzealad': [50.000000, 0, 0, 35.714286],
'Pakistan': [61.111111, 0, 0, 27.777778],
'SouthAfrica':[60.000000,0,10.000000, 30.000000],
'SriLanka':[52.941176, 0, 5.882353, 35.294118],
'WestIndies':[62.500000, 0, 12.500000, 12.500000]
}
def survey(results, category_names):
labels = list(results.keys())
data = np.array(list(results.values()))
data_cum = data.cumsum(axis=1)
category_colors = plt.get_cmap('RdYlGn')(
np.linspace(0.15, 0.85, data.shape[1]))
fig, ax = plt.subplots(figsize=(9.2, 5))
ax.invert_yaxis()
ax.xaxis.set_visible(False)
ax.set_xlim(0, np.sum(data, axis=1).max())
for i, (colname, color) in enumerate(zip(category_names, category_colors)):
widths = data[:, i]
starts = data_cum[:, i] - widths
ax.barh(labels, widths, left=starts, height=0.5,
label=colname, color=color)
xcenters = starts + widths / 2
r, g, b, _ = color
text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
for y, (x, c) in enumerate(zip(xcenters, widths)):
ax.text(x, y, str(int(c)), ha='center', va='center',
color=text_color)
ax.legend(ncol=len(category_names), bbox_to_anchor=(0, 1),
loc='lower left', fontsize='small')
return fig, ax
survey(results, category_names)
plt.title('Status & Number of matches/Team',loc = 'right')
plt.show()
Also, as mentioned in Fig 2, we see that England and India are evenly matched teams. India performs slightly better based on the stats. But the fact that England will be playing at home, hence, gives them an edge over India.
However, this data alone is not enough to analyse the winner of the cup. We will now analyse player's data to draw some more conclusions.
Batsman = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Batsman_Data.csv')
ground_Data = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Ground_Averages.csv')
odi_Scores_Data = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ODI_Match_Totals.csv')
odi_Results_Data = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/ODI_Match_Results.csv')
wc_Players_Data = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/WC_players.csv')
bowler = pd.read_csv(r'/Users/Pushkar/UMD_Files/Sem1/Python/Project/Bowler_data.csv')
WC_venue_pitches = ["The Oval, London","Trent Bridge, Nottingham","Sophia Gardens, Cardiff","County Ground, Bristol","Rose Bowl, Southampton","County Ground, Taunton","Old Trafford, Manchester","Edgbaston, Birmingham","Headingley, Leeds","Lord's, London","Riverside Ground, Chester-le-Street"]
WC_Ground_Stats = []
ODI_Grounds = odi_Scores_Data.Ground
for i in ODI_Grounds:
for j in WC_venue_pitches:
if i in j:
#print("i ; ",i,"--j : ",j)
WC_Ground_Stats.append((i,j))
stadiums_data = [item[0] for item in set(WC_Ground_Stats)]
World_Cup_Grounds=["Lord's, London - England", "The Rose Bowl, Southampton - England", "Trent Bridge, Nottingham - England", "Sophia Gardens, Cardiff - England", "Kennington Oval, London - England", "Edgbaston, Birmingham - England", "Old Trafford, Manchester - England", "Riverside Ground, Chester-le-Street - England", "Headingley, Leeds - England", "County Ground, Bristol - England", "County Ground, Taunton - England"]
World_Cup_Grounds
Worldcup_Ground_Stats = []
England_Grounds = ground_Data.Ground
for grounds in England_Grounds:
for venues in World_Cup_Grounds :
if grounds in venues:
Worldcup_Ground_Stats.append((venues))
Worldcup_Ground_Stats
WorldCup_Grounds_Stats = ground_Data[ground_Data.Ground.isin([Ground for Ground in Worldcup_Ground_Stats])]
WorldCup_Grounds_Stats.head()
This data can used to understand if a ground favours a bowler or a batsman. A low scoring ground is supposed to favour a bowler as not many runs are scored and a high scoring ground is supposed to favour the batsman.
Batsman.drop(columns=Batsman.columns[0],inplace=True)
Batsman = Batsman[~Batsman.Bat1.isin(["DNB","TDNB"])]
Batsman = Batsman[Batsman.Player_ID.isin(wc_Players_Data.ID)]
stadiums = [item[0] for item in set(WC_Ground_Stats)]
Batsman_in_England = Batsman[Batsman.Ground.isin(stadiums)]
Batsman_in_England.head()
def Out_NotOut(value):
if "*" in value:
return 0
else:
return 1
Batsman_in_England["Out_NotOut"] = Batsman_in_England["Bat1"].apply(Out_NotOut)
#Batsman_in_England
Batsman_in_England["Runs"] = Batsman_in_England["Runs"].astype("int")
Batsman_in_England["BF"] = Batsman_in_England["BF"].astype("int")
Batsman_in_England["4s"] = Batsman_in_England["4s"].astype("int")
Batsman_in_England["6s"] = Batsman_in_England["6s"].astype("int")
Batsman_Data_Dummy = Batsman_in_England
Batsman_in_England = Batsman_in_England.groupby(["Ground","Batsman"]).sum().reset_index()
Batsman_in_England["Average"] = Batsman_in_England["Runs"]/Batsman_in_England.Out_NotOut
Batsman_in_England.head()
Here we look at the averages and strike rates of the batsmen.There are other columns as well that contain the data about 4s and 6s the batsman hit and how many times he remained 'Not out' during his innings on these grounds.
Batsman_Data = Batsman_in_England.groupby(["Batsman"]).sum().reset_index()
Batsman_Data["Average"] = Batsman_Data["Runs"]/Batsman_Data["Out_NotOut"]
Batsman_Data.sort_values(by = "Average",ascending=False).sample(5)
Batsman_Average_Best = Batsman_Data[(Batsman_Data.Out_NotOut>0) & (Batsman_Data.Average > 40 )].sort_values(by = "Average",ascending = False)
Batsman_Average_Best.head()
Batsman_NoDuplicate = Batsman[["Player_ID","Batsman"]].drop_duplicates()
PlayerID = list(Batsman_Average_Best.merge(Batsman_NoDuplicate,how = "left",on = "Batsman")["Player_ID_y"].astype("int"))
Batsman_Average_Best["Player_ID"] = PlayerID
wc_Players_Data.columns = ["Player", "Player_ID","Country"]
Player_Country = list(Batsman_Average_Best.merge(wc_Players_Data,how = "left",on = "Player_ID")["Country"])
Batsman_Average_Best["Country"] = Player_Country
Batsman_Average_Best.head()
# Calculation for computing the Strike Rate
Batsman_Average_Best["Strike_Rate"] = Batsman_Average_Best["Runs"]/Batsman_Average_Best["BF"]*100
Batsman_Average_Best.head(10)
Batsman_Average_Best.sort_values(["Strike_Rate"],ascending = False).head(10)
Hence looking at the data above, we think that England will have an advantage in the Batting department. Below, we will create a visualization representing this finding
import matplotlib.pyplot as plt
Pie_Batsmen = pd.DataFrame(Batsman_Average_Best["Country"].value_counts(), columns=["Country"])
Pie_Batsmen.index.name="Name"
plt.pie(Pie_Batsmen["Country"],labels=Pie_Batsmen.index,autopct='%1.1f%%')
plt.axis('equal')
plt.title('Percentage of Top Ranked Batsmen in a Team')
plt.show()
# Filtering the bowlers data for the matches that were played in england
bowler = bowler[bowler.Ground.isin(stadiums)]
# Removing the rows from data where the overs is blank (-)
bowler = bowler[~bowler.Overs.str.contains('-')]
bowler.head()
# Total number of balls bowled by the bowler
def total_balls_bowled(value):
if "." in value:
over = value.split(".")
return int(over[0]) * 6 + int(over[1])
else:
return int(value) * 6
Ground_names = dict(set(WC_Ground_Stats))
def Full_Ground_names(value):
return Ground_names[value]
Ground_names
# Sum of the bowler stats from all the matches played on all the grounds
bowler["Balls"] = bowler.Overs.apply(total_balls_bowled)
for i in ["Runs","Mdns","Wkts","Balls"]:
bowler[i] = bowler[i].astype("float")
Bowlers_in_England = bowler.groupby(["Bowler"]).sum()[["Runs","Mdns","Wkts","Balls"]].reset_index()
Bowlers_in_England.head()
# From the sum above, we are now calculating the stats of a bowler for only the grounds in England
Bowlers_in_England["Economy"] = Bowlers_in_England.Runs * 6 /Bowlers_in_England.Balls
Bowlers_in_England["Average"] = Bowlers_in_England.Runs/ Bowlers_in_England.Wkts
Bowlers_in_England["Strike_Rate"] = Bowlers_in_England.Balls / Bowlers_in_England.Wkts
Bowlers_in_England.head()
Since we want to look at the data of only the best bowlers in a team, we are removing the data of all the bowlers who have bowled less than 10 Overs in England. Also, we are deleting the records for the bowlers who have taken 0 wickets.
Bowlers_in_England = Bowlers_in_England[(Bowlers_in_England.Balls > 60) & (Bowlers_in_England.Wkts > 0)]
Bowlers_in_England.head()
unique_bowler = bowler[['Player_ID','Bowler']].drop_duplicates()
Bowlers_in_England = Bowlers_in_England.merge(unique_bowler,how = "left",on = "Bowler")
wc_Players_Data.columns = ["Player", "Player_ID","Country"]
Country_Player = list(Bowlers_in_England.merge(wc_Players_Data,how = "left",on = "Player_ID")["Country"])
Bowlers_in_England["Country"] = Country_Player
Bowlers_in_England.iloc[55,-1] = "SriLanka"
Bowlers_in_England.head()
Bowlers_in_England.sort_values(by = ["Mdns"], ascending=False)[:10]
Bowlers_in_England["Percentage_Of_Maiden_Overs"] = ((Bowlers_in_England.Mdns*6)/(Bowlers_in_England.Balls))*100
Bowlers_in_England.sort_values(by = ['Percentage_Of_Maiden_Overs'], ascending = False).head(10)
Bowlers_in_England.sort_values(by = ["Average"], ascending = True).head(10)
Bowlers_in_England.sort_values(by = ["Economy"])[:10]
Bowlers_in_England.sort_values(by = ["Strike_Rate"])[:10]
aggregations = {
'Runs':'sum',
'Mdns':'sum',
'Wkts':'sum',
'Balls':'sum',
'Economy': 'mean',
'Average':'mean',
'Strike_Rate':'mean',
'Percentage_Of_Maiden_Overs':'mean'
}
Bowlers_in_England_TeamWise_Data = Bowlers_in_England.groupby('Country').agg(aggregations).reset_index()
Bowlers_in_England_TeamWise_Data
plt.figure(figsize=(15,10))
sns.boxplot(x = "Country", y = "Economy", data = Bowlers_in_England).set_title("Average Economy Rate - Team Wise")
plt.figure(figsize=(15,10))
sns.boxenplot(x = "Country", y = "Strike_Rate", data = Bowlers_in_England).set_title("Average Strike Rate - Team Wise")
plt.figure(figsize=(15,8))
g = sns.lineplot( data = Bowlers_in_England_TeamWise_Data[["Economy","Percentage_Of_Maiden_Overs"]])
g.set_xticklabels(["Australia"]+[item for item in Bowlers_in_England_TeamWise_Data.Country])
plt.title('Comparison between Economy and Perc. Of Maiden Overs')
In this analysis, we have focused on three different parameters to predict the outcome of the tournament. The analysis of the team, batsmen and bowler’s performance in the world cup venues (England in this case). These analyses gave us some insight into which team would have an edge in the tournament.
Traditionally, this has always been the case that the host nation tends to start a tournament as favorites. The last two world cups had also been won by the host nation. The fact that the host nation plays maximum number of matches at their home gives them the edge as they are well verse with the conditions at play.
Our model will work true for any world cup in the future as well. We would just need to update the venue details in our data set and add the corresponding data about the teams and the venue to our data files.